AITopics | data error

Collaborating Authors

data error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Stress-Testing ML Pipelines with Adversarial Data Corruption

Zhu, Jiongli, Xu, Geyang, Lorenzi, Felipe, Glavic, Boris, Salimi, Babak

arXiv.org Artificial IntelligenceJun-3-2025

Structured data-quality issues, such as missing values correlated with demographics, culturally biased labels, or systemic selection biases, routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce SAVAGE, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. SAVAGE employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5 %) of structured corruptions identified by SAVAGE severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, SAVAGE provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.

artificial intelligence, data quality, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.0123

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.67)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
(2 more...)

Add feedback

Tradeoff of generalization error in unsupervised learning

Kim, Gilhan, Lee, Hojun, Jo, Junghyo, Baek, Yongjoo

arXiv.org Artificial IntelligenceSep-12-2023

Finding the optimal model complexity that minimizes the generalization error (GE) is a key issue of machine learning. For the conventional supervised learning, this task typically involves the bias-variance tradeoff: lowering the bias by making the model more complex entails an increase in the variance. Meanwhile, little has been studied about whether the same tradeoff exists for unsupervised learning. In this study, we propose that unsupervised learning generally exhibits a two-component tradeoff of the GE, namely the model error and the data error -- using a more complex model reduces the model error at the cost of the data error, with the data error playing a more significant role for a smaller training dataset. This is corroborated by training the restricted Boltzmann machine to generate the configurations of the two-dimensional Ising model at a given temperature and the totally asymmetric simple exclusion process with given entry and exit rates. Our results also indicate that the optimal model tends to be more complex when the data to be learned are more complex.

configuration, generalization error, ising model, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1088/1742-5468/ace42c

2303.05718

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

DataVinci: Learning Syntactic and Semantic String Repairs

Singh, Mukul, Cambronero, José, Gulwani, Sumit, Le, Vu, Negreanu, Carina, Verbruggen, Gust

arXiv.org Artificial IntelligenceAug-21-2023

String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2308.10922

Country:

North America > United States > New York (0.04)
North America > United States > Nevada (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Contrastive Error Attribution for Finetuned Language Models

Ladhak, Faisal, Durmus, Esin, Hashimoto, Tatsunori

arXiv.org Artificial IntelligenceJul-11-2023

Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in NLG datasets. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.93 at detecting known data errors across synthetic tasks with known ground truth, substantially outperforming existing approaches. Using this approach and re-training models on cleaned data leads to a 70% reduction in entity hallucinations on the NYT dataset and a 55% reduction in semantic errors on the E2E dataset.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.10722

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China (0.05)
Europe > United Kingdom > Wales (0.05)
(11 more...)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Baseball (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.90)

Add feedback

Computed Tomography Reconstruction Using Deep Image Prior and Learned Reconstruction Methods

Baguer, Daniel Otero, Leuschner, Johannes, Schmidt, Maximilian

arXiv.org Machine LearningMar-12-2020

In this work, we investigate the application of deep learning methods for computed tomography in the context of having a low-data regime. As motivation, we review some of the existing approaches and obtain quantitative results after training them with different amounts of data. We find that the learned primal-dual has an outstanding performance in terms of reconstruction quality and data efficiency. However, in general, end-to-end learned methods have two issues: a) lack of classical guarantees in inverse problems and b) lack of generalization when not trained with enough data. To overcome these issues, we bring in the deep image prior approach in combination with classical regularization. The proposed methods improve the state-of-the-art results in the low data-regime.

data error, psnr, ssim, (16 more...)

arXiv.org Machine Learning

2003.04989

Country:

Europe > Germany > Bremen > Bremen (0.28)
Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre:

Research Report (0.64)
Overview (0.48)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Model Selection for Simulator-based Statistical Models: A Kernel Approach

Kajihara, Takafumi, Kanagawa, Motonobu, Nakaguchi, Yuuki, Khandelwal, Kanishka, Fukumiziu, Kenji

arXiv.org Machine LearningFeb-7-2019

We propose a novel approach to model selection for simulator-based statistical models. The proposed approach defines a mixture of candidate models, and then iteratively updates the weight coefficients for those models as well as the parameters in each model simultaneously; this is done by recursively applying Bayes' rule, using the recently proposed kernel recursive ABC algorithm. The practical advantage of the method is that it can be used even when a modeler lacks appropriate prior knowledge about the parameters in each model. We demonstrate the effectiveness of the proposed approach with a number of experiments, including model selection for dynamical systems in ecology and epidemiology.

coefficient, experiment, model selection, (16 more...)

arXiv.org Machine Learning

1902.02517

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.05)
Asia > Middle East > Lebanon (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Bayesian Model Comparison and Backprop Nets

MacKay, David J. C.

Neural Information Processing SystemsDec-31-1992

The Bayesian model comparison framework is reviewed, and the Bayesian Occam's razor is explained. This framework can be applied to feedforward networks, making possible (1) objective comparisons between solutions using alternative network architectures; (2) objective choice of magnitude and type of weight decay terms; (3) quantified estimates of the error bars on network parameters and on network output. The framework also generates a measure of the effective number of parameters determined by the data. The relationship of Bayesian model comparison to recent work on prediction of generalisation ability (Guyon et al., 1992, Moody, 1992) is discussed.

error bar, inference, occam factor, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Pasadena (0.04)
Europe > United Kingdom (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Bayesian Model Comparison and Backprop Nets

MacKay, David J. C.

Neural Information Processing SystemsDec-31-1992

error bar, inference, occam factor, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Pasadena (0.04)
Europe > United Kingdom (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Bayesian Model Comparison and Backprop Nets

MacKay, David J. C.

Neural Information Processing SystemsDec-31-1992

The Bayesian model comparison framework is reviewed, and the Bayesian Occam's razor is explained. This framework can be applied to feedforward networks, making possible (1) objective comparisons between solutions using alternative network architectures; (2) objective choice of magnitude and type of weight decay terms; (3) quantified estimates of the error bars on network parameters and on network output. The framework also generates ameasure of the effective number of parameters determined by the data. The relationship of Bayesian model comparison to recent work on prediction ofgeneralisation ability (Guyon et al., 1992, Moody, 1992) is discussed.

artificial intelligence, machine learning, occam factor, (14 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback